KV cache

KV cache trades space for time

Without KV cache: when generating token N, we need

1. recompute the Key and Value matrices for all previous tokens (1 through N-1)
1. compute attention using these Keys and Values

With KV cache:

1. store the Key and Value vectors for each token after computing them once
1. when generating new tokens, reuse the cached K,V vectors from previous tokens
1. only compute new K,V vectors for the current token being generated

Optimization techniques:

MQA/GQA: Reduce KV cache by sharing K,V across heads
MLA: Compresses K,V into low-rank representations
Sliding window: Only cache recent tokens
KV cache quantization: Store K,V in lower precision (e.g., INT8)

Backlinks

Attention Mechanism Variants
- compress Keys and Values into a low-rank latent space, reducing KV cache requirements
Variants
- compress Keys and Values into a low-rank latent space, reducing KV cache requirements